12 research outputs found
Writing Assistants Should Model Social Factors of Language
Intelligent writing assistants powered by large language models (LLMs) are
more popular today than ever before, but their further widespread adoption is
precluded by sub-optimal performance. In this position paper, we argue that a
major reason for this sub-optimal performance and adoption is a singular focus
on the information content of language while ignoring its social aspects. We
analyze the different dimensions of these social factors in the context of
writing assistants and propose their incorporation into building smarter, more
effective, and truly personalized writing assistants that would enrich the user
experience and contribute to increased user adoption.Comment: 2 pages, Accepted to In2Writing Workshop (CHI 2023
ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models
In recent times, large language models (LLMs) have shown impressive
performance on various document-level tasks such as document classification,
summarization, and question-answering. However, research on understanding their
capabilities on the task of self-contradictions in long documents has been very
limited. In this work, we introduce ContraDoc, the first human-annotated
dataset to study self-contradictions in long documents across multiple domains,
varying document lengths, self-contradictions types, and scope. We then analyze
the current capabilities of four state-of-the-art open-source and commercially
available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4
performs the best and can outperform humans on this task, we find that it is
still unreliable and struggles with self-contradictions that require more
nuance and context. We release the dataset and all the code associated with the
experiments
Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization
This work investigates the effectiveness of different pseudonymization
techniques, ranging from rule-based substitutions to using pre-trained Large
Language Models (LLMs), on a variety of datasets and models used for two widely
used NLP tasks: text classification and summarization. Our work provides
crucial insights into the gaps between original and anonymized data (focusing
on the pseudonymization technique) and model quality and fosters future
research into higher-quality anonymization techniques to better balance the
trade-offs between data protection and utility preservation. We make our code,
pseudonymized datasets, and downstream models publicly availableComment: 10 pages. Accepted for TrustNLP workshop at ACL202
Read, Revise, Repeat: A System Demonstration for Human-in-the-loop Iterative Text Revision
Revision is an essential part of the human writing process. It tends to be
strategic, adaptive, and, more importantly, iterative in nature. Despite the
success of large language models on text revision tasks, they are limited to
non-iterative, one-shot revisions. Examining and evaluating the capability of
large language models for making continuous revisions and collaborating with
human writers is a critical step towards building effective writing assistants.
In this work, we present a human-in-the-loop iterative text revision system,
Read, Revise, Repeat (R3), which aims at achieving high quality text revisions
with minimal human efforts by reading model-generated revisions and user
feedbacks, revising documents, and repeating human-machine interactions. In R3,
a text revision model provides text editing suggestions for human writers, who
can accept or reject the suggested edits. The accepted edits are then
incorporated into the model for the next iteration of document revision.
Writers can therefore revise documents iteratively by interacting with the
system and simply accepting/rejecting its suggested edits until the text
revision model stops making further revisions or reaches a predefined maximum
number of revisions. Empirical experiments show that R3 can generate revisions
with comparable acceptance rate to human writers at early revision depths, and
the human-machine interaction can get higher quality revisions with fewer
iterations and edits. The collected human-model interaction dataset and system
code are available at \url{https://github.com/vipulraheja/IteraTeR}. Our system
demonstration is available at \url{https://youtu.be/lK08tIpEoaE}.Comment: Accepted by The First Workshop on Intelligent and Interactive Writing
Assistants at ACL202
Speakerly: A Voice-based Writing Assistant for Text Composition
We present Speakerly, a new real-time voice-based writing assistance system
that helps users with text composition across various use cases such as emails,
instant messages, and notes. The user can interact with the system through
instructions or dictation, and the system generates a well-formatted and
coherent document. We describe the system architecture and detail how we
address the various challenges while building and deploying such a system at
scale. More specifically, our system uses a combination of small, task-specific
models as well as pre-trained language models for fast and effective text
composition while supporting a variety of input modes for better usability.Comment: Accepted at EMNLP 2023 Industry Trac
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe